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ABSTRACT 

In general, self help systems are being increasingly de¬ 
ployed by service based industries because they are capable of 
delivering better customer service and increasingly the switch 
is to voice based self help systems because they provide a 
natural interface for a human to interact with a machine. A 
speech based self help system ideally needs a speech recog¬ 
nition engine to convert spoken speech to text and in addi¬ 
tion a language processing engine to take care of any mis- 
recognitions by the speech recognition engine. Any off-the- 
shelf speech recognition engine is generally a combination of 
acoustic processing and speech grammar. While this is the 
norm, we believe that ideally a speech recognition applica¬ 
tion should have in addition to a speech recognition engine a 
separate language processing engine to give the system bet¬ 
ter performance. In this paper, we discuss ways in which the 
speech recognition engine and the language processing en¬ 
gine can be combined to give a better user experience. 

Index Terms — User Interface; Self help solution; Spo¬ 
ken Language System, Speech Recognition, Natural Lan¬ 
guage 

1. INTRODUCTION 

Self help solutions are being increasingly deployed by many 
service oriented industries essentially to serve their customer- 
base any time, any where. Technology based on artificial in¬ 
telligence are being used to develop self help solutions. Typi¬ 
cally self help solutions are web based and voice based. Oflate 
use of voice based self help solutions are gaining popular¬ 
ity because of the ease with which they can be used fueled 
by the significant development in the area of speech technol¬ 
ogy. Speech recognition engines are being increasingly used 
in several applications with varying degree of success. The 
reason businesses are investing in speech are several. Signif¬ 
icant reasons among them are the return on investment (Rol) 
which for speech recognition solutions is typically 9—12 
months. In many cases it is as less as 3 months m. In addi¬ 
tion, speech solutions are economical and effective in improv¬ 
ing customer satisfaction, operations and workforce produc¬ 
tivity. Password resetting using speech 0, airline enquiry. 


talking yellow pages 0 and more recently in the area of con¬ 
tact centers are some of the areas where speech solutions have 
been demonstrated and used. 

The increased performance of the speech solution can be 
primarily attributed to several factors. For example, the work 
in the area of dialog design, language processing have con¬ 
tributed to the performance enhancement of the speech solu¬ 
tion making them deployable in addition to the fact that peo¬ 
ple have become more comfortable using voice as an inter¬ 
face to transact now. The performance of a speech engine is 
primarily based on two aspects, namely, (a) the acoustic per¬ 
formance and (b) the non-acoustic performance. While the 
change in the acoustic performance of the speech engine has 
increased moderately the mature use of non-acoustic aspects 
have made the speech engine usable in applications; a combi¬ 
nation of this in total enables good user experience. 

Any speech based solution requires a spoken speech 
signal to be converted into text and this text is further pro¬ 
cessed to derive some form of information from an electronic 
database (in most practical systems). The process of con¬ 
verting the spoken speech into text is broadly the speech 
recognition engine domain while the later, converting the text 
into a meaningful text string to enable a machine to process 
it is the domain of natural language (NL) processing. In 
literature, these two have a very thin line demarcating them 
and is usually fuzzy, because the language processing is also 
done in the form of speech grammar in a speech recognition 
engine environment. This has been noted by Pieraccini et al 
0. where they talk about the complexity involved in inte¬ 
gration language constraints into a large vocabulary speech 
recognition system. They propose to have a limited language 
capability in the speech recognizer and transfer the complete 
language capability to a post processing unit. Young et al 0 
speak of a system which combines natural language process¬ 
ing with speech understanding in the context of a problem 
solving dialog while Dirk et al 0 suggest the use language 
model to integrate speech recognition with semantic analysis. 
The MIT Voyager speech understanding system 0 interacts 
with the user through spoken dialog and the authors describe 
their attempts at the integration between the speech recogni¬ 
tion and natural language components. 0 talks of combining 
statistical and knowledge based spoken language to enhance 


speech based solution. 

In this paper, we describe a non-dialog based self help 
system, meaning, a query by the user is responded by a single 
answer by the system; there is no interactive session between 
the machine and the user (as in a dialog based system). The 
idea being that the user queries and the system responds with 
an answer assuming that the query is complete in the sense 
that an answer is fetchable. In the event the query is incom¬ 
plete the natural language processing (NL) engine responds 
with a close answer by making a few assumptions, when re¬ 
quired [|9] ■ The NL engine in addition also corrects any pos¬ 
sible speech engine mis-recognition. In this paper, we make 
no effort to distinguish the differences in the way language 
is processed in the speech recognition module and the natu¬ 
ral language processing module. We argue that it is optimal 
(in the sense of performance of the speech solution plus user 
experience) to use as a combination of language model in the 
speech recognition and natural language processing modules. 
We first show (Section [2] that language processing has to be 
distributed and can not be limited to either the speech recogni¬ 
tion or the natural language processing engine. We then show 
how using a readily available SR engine and a NL engine 0 
how the distribution of language processing helps in proving 
better user experience. We describe a speech based self self 
help system in Section[3]and describe the user experiences in 
Section[4] We conclude in SectionQ 

2. BACKGROUND 

For any speech based solution to work in field there are two 
important parameters, namely, the accuracy of the speech 
recognition engine and the overall user experience. While 
both of these are not entirely independent it is useful to con¬ 
sider them as being independent to be able to understand the 
performance of the speech solution and the user experience 
associated with the solution. User experience is measured by 
the freedom the system gives the user in terms of (a) who can 
speak (speaker independent), (b) what can be spoken (large 
vocabulary) and (c) how to speak (restricted or free speech) 
while the speech recognition accuracies are measured as the 
ability of the speech engine to convert the spoken speech into 
exact text. 

Let IF represent speech recognition engine and let IE be 
the natural language processing engine. Observe that IF con¬ 
verts the acoustic signal or a time sequence into a string se¬ 
quence (string of words) 

IF : time sequence —>• string sequence 

while IE processes a string sequence to generate another 
string sequence. 

IE : string sequence —> string sequence 

Let q t represents the spoken query corresponding to, say, 
the string query q s (it can be considered the read version of 


the written string of words q s ). Then the operations of the 
speech and the natural language processing engines can be 
represented as 

IF(qt) = q s ' (speech engine) 

JE{q s ') = q s " (NLprocessing) (1) 

Clearly, the speech recognition engine IF uses acoustic mod¬ 
els (usually hidden Markov Model based) and language gram¬ 
mar which are tightly coupled to convert q t to q s > while the 
natural language engine IE operates on gy and uses only sta¬ 
tistical or knowledge based language grammar to convert it 
into q s ". It is clear that the language processing happens 
both in IF and IE the only difference being that the language 
processing in IF is tightly coupled with the overall function¬ 
ing of the speech recognition engine unlike in IE. Language 
processing or grammar used in IF is tightly coupled with 
the acoustic models and hence the degree of configurabil¬ 
ity is very limited (speech to text). At the same time lan¬ 
guage processing is necessary to perform reasonable recog¬ 
nition (speech recognition performance). While there is a rel¬ 
atively high degree of configurability possible in IE : q s —► 
q s (text to text). The idea of any speech based solution is 
to build IF and IE such that their combined effort, namely, 
lE(IF(qt)) = q s " is such that q s " k, q s . Do we need lan¬ 
guage processing in both IF and IE or is it sufficient to (a) 
isolate IF and IE; and have language processing only in IE or 
(b) combine all language processing into IF and do away with 
IE completely. Probably there is an optimal combination of 
IF and IE which produces a usable speech based solution. 

An ideal speech recognition system should be able to con¬ 
vert q t into the exact query string q s . Assume that there are 
three different types of speech recognition engines. Let the 
speech recognition engine IFi allow any user to speak any¬ 
thing (speaker independent dictation system); IF 2 be such that 
it is IF 1 but the performance is tuned to a particular person 
(person dependent) and IF 3 is such that it is ¥2 additionally 
constrained in the sense that it allows the user to speak from 
within a restricted grammar. Clearly the user experience is 
best for IFi{x t ) = x\, (user experience: f) and worst for 
IF 3 (xf) = x'l< (user experience: |) and it between experience 
is provided by IF\{xt) = x (user experience: yy). 

Let d(x s , y s ) be the distance between the string x s and 
y s . Clearly, d(x\,,x s ) > d{x 2 ,,x s ) > d(Xg,,x s ), the perfor¬ 
mance of the speech engine is best for IF 3 followed by IF 2 
followed by IF 1 . Observe that in terms of user experience it 
is the reverse. For the overall speech system to perform well 
the contribution of IE would vary, namely IE should be able 
to generate g]„, q 2 s ,, and q using q^,, q 2 s , and q respec¬ 
tively, so that d{ql„,q s ) « d(q 2 „,q s ) « d(q%„,q s ) « 0. The 
performance of IE has to be better to compensate for the poor 
performance of IF; for example the performance of IE 1 has to 
be better than the performance of IE 3 to compensate for the 
poor performance of IF 1 compared to IF 3 . 


Typically, a IF i (ideal user experience) speech recogni¬ 
tion would be categorized by (a) Open Speech (free speech - 
speak without constraints), (b) Speaker independent (different 
accents, dialects, age, gender) and (c) Environment indepen¬ 
dent (office, public telephone). While (a) greatly depends on 
the language model used in the speech recognition system, 
both (b) and (c) depend on the acoustic models in the SR. For 
IF i type of system, the user experience is good but speech 
recognition engine accuracies are poor. On the other hand, 
a typical 1 F 3 (bad on user experience) would be categorized 
by limiting the domain of operation and the system would be 
tuned (in other words constrained) to make use of prior infor¬ 
mation on expected type of queries. 

In the next section we describe a voice based self help 
system which enables us to tune the language grammar and 
hence control the performance of the speech recognition en¬ 
gine. 


3. VOICE BASED SELF HELP SYSTEM 



Fig. 1. Block Diagram of a Self Help System 


The self help system was designed to cater to 

1. different kinds of information sought by insurance 
agent 


Voice based self help system is a speech enabled solution 
which enables human users to interact with a machine using 
their speech to carry out a transaction. To better understand 
the role of IF and IE in a speech solution we actually built a 
voice based self help system. The self help system was built 
using the Speech Recognition (IF) engine of Microsoft using 
Microsoft SAPI SDK flOl and the Language Processing (IE) 
module was developed in-house 0. 

In general, insurance agents act as intermediaries be¬ 
tween the insurance company (service providing company) 
and their clients (actual insurance seekers). Usually, the 
insurance agents keep track of information of their clients 
(policy status, maturity status, change of address request 
among other things) by being in touch with the insurance 
company. In the absence of a self help system, the insurance 
agents got information by speaking to live agents at a call 
center run by the insurance company. The reason for building 
a self help system was to enable the insurance company to 
lower the use of call center usage and additionally providing 
dynamic information needed by agents; both this together 
provide better customer service. The automated self help 
system, enabled answering queries of an insurance agent. 
Figure |T] shows a high level functional block representation 
of the self help system. The user (represented as a mobile 
phone in Figure [TJ calls a predetermined number and speaks 
his queries to get information. The speech recognition engine 
converts the spoken query (speech signal) into text; this text 
is operated upon by the natural language processing block. 
This processed string is then used to fetch an answer from the 
database. The response to this query, which is a text string, is 

(a) spoken out to the user using a text to speech engine and 

(b) alternatively is sent to the user as a SMS. We used Kannel, 
an open source WAP and SMS gateway to send the answer 
string as SMS QTj. 


(a) on behalf of their clients (example. What is the 
maturity value of the policy TRS1027465) and 

(b) themselves (example. When was my last commis¬ 
sion paid?). 

2. different accents, 

3. handle different complexity of queries and 

4. Additionally the system should be able to accept natural 
English query. 

5. different ways in which same queries can be asked, Ex¬ 
amples: 

(a) Surrender value of policy TRS 1027465? 

(b) What is the surrender value of policy TRS 1027465? 

(c) Can you tell me surrender value of policy TRS 1027465 

(d) Please let me know the surrender value of policy 
TRS1027465? 

(e) Please tell me surrender value of policy TRS 1027465? 

(f) Tell me surrender value of policy TRS 1027465? 

(g) My policy is TRS 1027465. What is its surrender 
value? 

should all be understood as being queried for the 

... surrenderwalue of the policy TRS1027465.... 

Note that the performance of the speech recognition en¬ 
gine is controlled by the speech grammar (see Figures [2j[3j|4] 
for examples of speech grammar) that drive the speech recog¬ 
nition engine. The speech grammar is used by the speech en¬ 
gine before converting the spoken acoustic signal into a text 
string. In Section [4] we show how the performance of the 
speech engine can be controlled by varying the speech gram¬ 


mar. 
















































4. EXPERIMENTAL RESULTS 

We built three versions of the self help system with varying 
degrees of processing distributed in IF and IE, namely, 

1. IF i has no grammar (Figure [2ji, giving a very high de¬ 
gree of freedom to the user as to what they can ask, 
giving them scope to ask invalid queries. 

2. IF 2 (Figure [3]l has liberal grammar; more processing in 
IE and 

3. IF 3 has a constrained grammar (see Figure |4|) which 
constraints the flexibility of what the user can say 

For example, IF 1 grammar would validate even an out of do¬ 
main query like What does this system do ? in one dimension 
and an incorrect query like What is last paid commission ad¬ 
dress change?. On the other extreme a IF 3 grammar would 
only recognize queries like What is the surrender value of 
Policy number or Can you please tell me the maturity value 
of Policy number and so on. Note that the constrained gram¬ 
mar W 3 gives a very accurate speech recognition because the 
speaker speaks what the speech engine expects this in turn 
puts very less load in terms of processing on IE. 

For the experimental setup, the IF\ grammar generated 
a total of 27 possible queries of which only 3 were not re¬ 
sponded by the IF 1 , IE combined system. On the other hand 
for a grammar of type IF 3 a total of 357 different queries that 
the user could ask possible (very high degree of flexibility to 
the user). Of these only a total of 212 queries were valid in the 
sense that they were meaningful and could be answered by the 
IF 3 , IE system the rest, 145, were processed by IE but were 
not meaningful and hence an answer was not provided. The 
performance of IF 2 grammar was in between these two cases 
producing a total of 76 possible queries that the user could 
ask, of which 20 were invalided by the IF 2 , IE combine. 

<GRAMMAR> 

<RULE NAME="F_1" TOPLEVEL="ACTIVE"> 

<RULEREF NAME="DonotCare"/> 

</RULE> 

</GRAMMAR> 

Fig. 2. IF p. No grammar; the speaker can speak anything. 


5. CONCLUSIONS 

The performance of a voice based self help solution has two 
components; user experience and the performance of the 
speech engine in converting the spoken speech into text. It 
was shown that IF and IE can be used jointly to come up 
with types of self help solutions which have varying effect 
on the user experience and performance of the speech en¬ 
gine. Further, we showed that on one hand by controlling the 


<GRAMMAR> 

<RULE NAME="F_2" TOPLEVEL="ACTIVE"> 
<RULEREF NAME="DonotCare"/> 

<RULEREF NAME="KeyConcept"/> 

<RULEREF NAME="DonotCare"/> 

<RULEREF NAME="KeyWord"/> 

<RULEREF NAME="DonotCare"/> 

</RULE> 

<RULE NAME="KeyConcept"> 

<P> Surrender Value </P> 

<P> Maturity Value </P> 

<P> ... </P> 

<P> Address Change </P> 

</RULE> 

<RULE NAME="Keyword"> 

<P> Policy Number </P> 

<P> ... </P> 

<P> ... </P> 

</RULE> 

</GRAMMAR> 

Fig. 3. IF 2 '. Liberal grammar: Some restriction on the user. 

language grammar one could provide better user experience 
but the performance of the speech recognition became poor 
while on the other hand when the grammar was such that the 
performance of speech engine was good the user experience 
became poor. This shows that there is a balance between the 
speech recognition accuracy and user experience that is to 
be maintained by people who design voiced based self help 
systems so that both the speech recognition accuracy is good 
without sacrificing the user experience. 
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